Zainab Ali

Our dataset is red wine dataset cointains 12 variables and 1599 observations.It includes 11 variables on the chemical properties of the wine,and one outcome variable that determines the quality of each wine,with rating form 0 (very bad) to 10 (very excellent).The aim of our exploring is to answer the important question:which chemical properties influence the quality of red wines?

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Since I have only one categorical variable(quality),let’s derive two more categorical variables from existing variables. First new variable called sweetness_range cointains two values:high and low,derived form residual.sugar variable. The mean is the base point,if the value >= maen(residual.sugar) it is considered to be high,and if the value lower than maen(residual.sugar) it is considered to be low. Second new variable called alcohol_range same idea as the previous variable.if the value >= maen(alcohol) it is considered to be high,and if the value lower than maen(alcohol) it is considered to be low.

# create a new categorical variable called sweetness_range
sugar_mean <- mean(redw$residual.sugar)
redw$sweetness_range <- ifelse(redw$residual.sugar >= sugar_mean,"high","low" )
# create a new categorical variable called alcohol_range
alcohol_mean <- mean(redw$alcohol)
redw$alcohol_range <- ifelse(redw$alcohol >= alcohol_mean,"high","low" )

Univariate Plots Section

Let’s start our exploration by exploring the distribution of various variables .

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

In our data set there is no wines with 10 even 9 rates quality ,most wines quality rate are medium rate (5 and 6),follwed by 7 rate .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most of wines have fixed acidity between 7.0 and 8.0 .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH distribution is almost a normal distribution,pH peaking around 3.3 with some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The peak is 0.0 g of citric acid,and there is an outlier equlas to 1.0 g.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol is skewed to the right, with most wines placed between around 9.0 and 9.5 .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The density is almost in bell shape a normal distribution,the mean is 0.9967 .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The residua sugar is a long tail distribution,so I transformed it using log10.Most of the observations are between 2.0 and 2.5 ,there are many outliers(around 15.0).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

It is a right skewed distribution,so I transformed it using log10 ,it is peaking around 0.7.There are outliers (around 2.0)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides variable also has a long tail,I transformed using log10,it is peaking at around 0.07 .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The mean of volatile.acidity around 0.5 .

## 
## high  low 
##  435 1164

As we ca see ,most of our samples are considered to as a low sweetness range

## 
## high  low 
##  683  916

The low alcohol range has a count higher than high range alcohol.

Univariate Analysis

What is the structure of your dataset?

The red wine data set contains 1599 samples with 12 variables ,11 of them are chemical properties of the wine .The quality variabel is rating variable its value is from 0 to 10, while 0 (very bad) and 10 (very excellent).

Other observations: -Most of wines quality rate is a meduim rate(5-6) -The median ph is about 3.3 -Most of wines have 0 citric acid -Around 75% of wines have sugar less than 2

What is/are the main feature(s) of interest in your dataset?

The main feature of this data set that I’m instersting in, is the quality and how the chemical properties affect and influence it.What makes wine get high quality?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think all other features and variables will support my investigation about wine quality.

Did you create any new variables from existing variables in the dataset?

yes, first one is sweetness_range was derived from residual.sugar variable,and second one is alcohol_range was derived from alcohol variable.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I get many right skewed and long tail distributions such as(resiual sugaer and chlorides) and I used bin width and log10 trnsformation to get better visualizations.

Bivariate Plots Section

It is turn to discover relationships between two variables.

After we have a general look using the scatterplot matrices.Let’s explore and discover interesting things and relationships between our variables.

Is there a relationship between alcohol and quality of wine?

## redw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## redw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## redw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## redw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## redw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## redw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

It look likes there is a positive relationship ,as alcohol increases the quality of the wine increases.

let’s explore another variable ,volatile.acidity vs. quality.

## redw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## redw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## redw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## redw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## redw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## redw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

It seems there is a negative relationship,as volatile acidity decreases the quality increases.

Now,It is the third variable turn,chlorides vs. quality.

## redw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## redw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## redw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## redw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## redw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## redw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

Low chlorides (almost 0.0) contributes to get high quality.There is a negative relationship,as chlorides decreases the quality increases.

let’s have a look how the alcohol and residual.sugar affect and influence the quality,using our new categroical variables.

## 
## high  low 
##  683  916
## redw$quality: 3
## 
## high  low 
##    3    7 
## -------------------------------------------------------- 
## redw$quality: 4
## 
## high  low 
##   20   33 
## -------------------------------------------------------- 
## redw$quality: 5
## 
## high  low 
##  137  544 
## -------------------------------------------------------- 
## redw$quality: 6
## 
## high  low 
##  335  303 
## -------------------------------------------------------- 
## redw$quality: 7
## 
## high  low 
##  172   27 
## -------------------------------------------------------- 
## redw$quality: 8
## 
## high  low 
##   16    2
## 
## high  low 
##  435 1164
## redw$quality: 3
## 
## high  low 
##    3    7 
## -------------------------------------------------------- 
## redw$quality: 4
## 
## high  low 
##   15   38 
## -------------------------------------------------------- 
## redw$quality: 5
## 
## high  low 
##  193  488 
## -------------------------------------------------------- 
## redw$quality: 6
## 
## high  low 
##  156  482 
## -------------------------------------------------------- 
## redw$quality: 7
## 
## high  low 
##   62  137 
## -------------------------------------------------------- 
## redw$quality: 8
## 
## high  low 
##    6   12

It proves what we’ve seen before in the scatter plot alcohol vs quality,there is a positive relationship,as alcohol increases the quality increases ,but it is opposite in residual sugar as sweetness decreases the quality increases(negative relationships).

Let’s now explore some relationships between some supporting variables away from our outcome variable(quality).

Scatter plot appears there is a negative relationship in some way between fixed.acidity vs. pH, as pH decreases the fixed acidity increases.

A positive relationship between fixed acidity vs. citric.acid.

It seems there is no a clear relationship between chlorides vs. citric acid,chlorides is constant and the citric acid is increasing.

Most of our samples have chlorides around 0.1 and residual sugar around 2.0 .

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

During my investigation I discovered a positive relationship as well negative relationships.The quality has a positive relationship with alcohol.However,quality has a negative relationship with volatile acidity ,chlorides,and residual sugar.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Actually I did plots for many variables against each other,but I don’t get a clear relationship,except two plots that show a relationship.There is a negative relationship between fixed acidity and pH,as well as a positive relationship between fixed acidity and citric acid .

What was the strongest relationship you found?

Alcohol is correlated in positive way with the quality rate,as one increases the other one increases,as well as this was proved using two different plots type.

Multivariate Plots Section

Let’s now explore there or more variables against each other.

It seems the majority of our samples are placed in low sweetness range. As it appears the high quality rates (7-8) show high alcohol range with low residual sugar .

let’s discover quality vs. ph against alcohol range and sweetness range.

The high quality rate with high alcohol range and low sweetness range has a ph around 3.2 to 3.3,while the lower quality rate has ph around 3.3. to 3.5 .

let’s discover quality vs. citric acid against alcohol range and sweetness range.

The lowest quality rate(3) has the lowest citric acid almost 0.0,while the highest one has citric acid around 0.4 The majority of 8 rate quality has high alcohol and low residual sugar with around 0.4 to 0.5 citric acid.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

High quality rate in most cases has high alcohol with low residual sugar.In addition it has ph around 3.2 to 3.3 ,as well as it has 0.4 to 0.5 g of citric acid .

Were there any interesting or surprising interactions between features?

The quality rate(3) has citric acid almost 0.0 ,against the both categorical variables alcohol_range and sweetness_range

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

No


Final Plots and Summary

Plot One

Description One

The most significat features in the plot are,it seems it is very hard to get a high quality rates,since the majority of our samples considered to be as a meduim rate(5-6).We notice also there is no rate with 9 or 10 !

Plot Two

Description Two

We can notice the most significat features in the plots are,as alcohol increases the quality increases as well,and as residual sugar decreases the quality increases. quality vs. alcohol > positive relationship quality vs. residual sugar > negative relationship

Plot Three

Description Three

The most significat features in the plots,are the lowest quality rate(3) that we have in our data has citric acid almost 0.0 !,and this is true against our the both categorical variables alcohol_range and sweetness_range.while the high quality rate has around 0.4 to 0.5 citric acid.


Reflection

The red wine dataset cointains 12 variables and 1599 observations.It includes 11 variables on the chemical properties of the wine and one categorical variable. During my exploriation there is no serious problems,but there is a kind of some struggling to understand the chemical terms.Since this dataset contains only one categorical variable,so to get better plots I derived two more categorical variables from existing variables. My big aim is to understand the relationships between quality and other variables,and what make wine get high quality rate. First of all,it look likes most of wines got a medium rare!,it seems it is very hard to get high quality! I discoverd a positive relationship between quality and alcohol,and negative relationship with volatile acidity ,chlorides,and residual sugar. We can use these relationships to make prediction model of wine quality.